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1.  Introduction 


The  Programmable  Image  Processing  Element  (PIPE)  program,  which 
was  conducted  by  Rockwell  International  from  September  1979  through 
September  1980,  consisted  of  three  actual  phases: 

1.  The  definition  of  a  programmable  device  which  could 
form  the  inner  product  of  a  pair  of  16-element 
vectors; 

2.  the  definition  of  a  programmable  device  which  could 
form  the  inner  product  of  a  pair  of  9-element 
vectors; 

3.  the  partial  logical  design,  simulation,  and  layout 
of  the  device  defined  in  2. 

Although  a  proposal  was  submitted  to  carry  the  program  through  the 
making  and  delivery  of  optomasks,  that  proposal  was  not  accepted 
due  to  lack  of  funds. 

This  report  contains  four  chapters.  Chapter  2  addresses  the  rationale 
for  the  existence  of  the  PIPE,  Chapter  3  describes  the  device  which 
was  defined  to  combine  a  pair  of  16-element  vectors,  and  Chapter  4 
describes  the  device  which  was  defined  to  combine  a  pair  of  9-element 
vectors. 

The  summary  of  the  device  design  effort  Is  contained  in  the  Appendix. 
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2.  Why  PIPE? 


2.1  Signal  Processing  Requirements  of  DoD 

Within  any  digital  communications  system,  or  radar,  or  sonar,  or  control 
system  we  encounter  innumerable  requirements  which  are  in  a  form  known 
as  a  digital  filter.  To  the  signal  this  function  is  analagous  to  the 
familiar  analog  filters  such  as  bandpass  or  lew-pass  or  high-pass.  These 
filters  are  described  by  equations  of  the  form 
N  M 

yk  ~  ^  i  an  xn-k  +  Xj  ^n  yn-k 
n=0  n=l 

where  yk  is  the  output  data  sequence,  is  the  input  data  sequence,  and 

the  a  and  b  coefficients  describe  the  filter  characteristic.  Modulators 
n  n 

are  described  by  equations  of  a  similar  but  degenerate  form 


y  k  mk  xk 

where  m.  is  the  modulation  sequence.  Multiplexers  are  described  by  equations 
of  the  form 


L 

yk  =  Xj  *k£ 

l*} 


where  one  n^  at  a  time  is  unity,  the  others  are  zero  and  x^ 
input  at  time  k. 


is 


the  £th 


Image  processing  requirements  are  frequently  of  the  form 


y 


mn 


a.,  x  .  . 

jk  m-j,  n-k 


j  k 
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as  are  many  forms  of  one  and  two-dimensional  transforms.  This  primal 
computational  form  has  arisen  so  frequently  In  signal-processing  systems 
that  the  efficient  mechanization  of  this  computation  Is  seen  as  a  vital 
step  in  DoD's  quest  for  high-speed  Signal  processing. 

Mechanization  of  signal-processing  functions  Is  painfully  slow  when  faced 
with  high-speed  data  for  real-time  processing.  Digital  signal-processing 
systems  can  be  Improved  in  terms  of  speed  and  lower  cost  if  the  normally 
employed  arithmetic  operations  are  streamlined. 

For  instance,  a  high  resolution  875-line  video  system  with  a  4:3  aspect 
ratio  and  8-bit  amplitude  resolution  provides  about  8.17  million  bits  per 
frame.  At  a  frame  rate  of  60  per  second,  a  3-color  display  now  requires  a 
data  rate  of  1.47  Gigabits/sec.’  Image  enhancement  calculations  at  that 
speed  seem  impossible  by  today's  standards. 

Special  signal-processing  devices  and  organizations  must  be  developed  to 
meet  the  increasingly  difficult  signal-processing  requirements.  The 
development  of  the  programmable  image-processing  element  Is  an  effort  to 
evolve  a  programmable  microprocessor  type  of  device  which  is  specifically 
a  signal  processor  for  high  data  rate  applications. 

The  most  frequently  required  image-processing  function  is  that  of  a 
sliding  window.  The  input  is  a  3x3  array  of  picture  elements,  or  pixels. 
Each  pixel  is  weighted  by  a  coefficient,  and  the  sum  of  products  is  the 
output,  usually  assigned  to  a  location  in  output  space  which  corresponds 
to  the  center  of  the  3x3  array. 

Edge  enhancement  can  be  performed  with  an  approximation  to  the  bi- 

4 

Laplacian  operator,  a  (•)  as  described  in  reference  1: 

ax^y2 


3 


< 


v  =  -2 


1  -2 


1  -2  1 


For  directional  edge  information  there  exist  several  means  to  approximate 
the  partial  derivative  in  the  direction  of  interest  as  shown  in  the 
following  figures  1  and  2  which  has  been  taken  from  reference  2. 

These  masks  are  but  a  sampling  of  the  variety  of  window  functions. 
Integration  can  also  be  performed  by  approximating  a  si  delobe-suppressing 
mean  operator: 


The  purpose  of  this  section  has  been  to  illustrate  the  versatility  of 
and  the  need  for  inner-product  operators  for  signal  processing. 
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Examples  of  Compass  Gradient 

Masks 

Figure  2.  The  Eight  Principal 
Directions  on  a  3x3 
Grid 


G.5.  Robinson,  "Detection  and  Coding  of  Edges  Using  Directional  Masks,"  SPIE 
Vnl.  87,  Advances_jn_ Image_Transmission  Techniques  (1976)  pp.  117-125. 
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3.  Recommended  Architecture 
3.1  Principle  of  Operation 

In  the  preceding  chapter  we  demonstrated  that  our  fundamental 
premise  is  that  most  signal-processing  tasks  can  be  expressed  as  a 
vector  dot  (or  inner,  or  scalar)  product, 
e.g. : 


P  =  x1  y]  +  x2  y2  + 
J 

=  I  x.  y. 

J=1  J  J 


*  ‘  XJ  yJ 


>>? 

‘“x? 

H 

where 

X  =  col^ ,  x2,  .  .  .  .  X 

and 

y  =  coljy^  y2 . y 

3  ... 


As  Peled  and  Liu  observed,  if  we  consider  the  x.  as  being  composed  of 

J 

numbers  of  K  amplitude  bits  and  a  sign  bit,  the  x.  can  be  expressed  as 

J 

fractional  values: 


-k 


XJ  '  k=0  V 


Therefore,  the  inner  product  may  be  written  as 

■k 


where 


J 

K 

P  =  E 

j=l 

£  aikyi 
k=0  J 

K 

-  0 

=  r 
k=0 

Ljii  ^ 

II 

1 

CM 

-a 

o* 

k=0 

i\ 

J 

%  =  E 
k  J  =  1 

ajkyj 

,  2 


-k 
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The  above  expressions  Imply  two  facts: 

Each  qk  may  be  generated  by  one  table  lookup  operation  In  a 
2^-1  word  memory,  where  the  word  length,  W  =  [logg  J]  + 
requirement  on  Individual  yj. 

p  may  be  generated  by  K  shlft-and-add  (S&A)  operations. 

These  equations  can  be  mechanized  by  the  simple  configuration  which  Is 
shown  below. 
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Figure  3.  Basic  ROM-Accumulator  Structure 

where  the  organization  of  the  ROM  is  indicated  below  In  Table  1  for  the 
Js3  case. 

TABLE  1.  BASIC  MEMORY  STRUCTURE  &  ORGANIZATION 


INPUT  PATTERN  EQUALS 
MEMORY  ADDRESS 

X1 

X2 

X3 

0 

0 

0 

0 

0 

1 

0 

1 

0 

0 

1 

I 

1 

0 

0 

1 

0 

1 

1 

1 

0 

1 

1 

1 

MEMORY 

CONTENT 


*3 

y2 

y2*y3 

yl 

yl*y3 

yl*y2 

yl+y2+y3 


If  our  problem  were  initially  one  of  convolution,  such  as 


J 


e  vj 

j=i 


then 

Ph  =  £  H  ]£ah-j,k  bh-j,i-k2  1 

i  j  k 

which  is  two-dimensional  convolution.  Three  facts  can  be  drawn  out.  from  this 

1.  Looking  at  the  problem  as  one  of  bit  manipulation  raises  the 
order  of  convolution  by  only  one.  (The  generalization  can  be 
rigorously  justified.) 

2.  Convergence  is  uniform  so  we  can  operate  the  summations  in  any 
order  without  affecting  the  answer. 

3.  This  impacts  heavily  computing  time  and  complexity  of  hardware. 


Let's  return  our  attention  to  the  basic  inner-product  statement: 


LEE,.  .,-1 

p  ■  i  3  k  ajk  bj,1-k2 


We  could  mechanize  this  by  summing  over  j,  then  over  k 


TREE 

Of 

ADOERS 


EOR  LARGE  J.  THESE 
ARE  QUITE  COMPLEX 


VJ 


Figure  4.  Block  Multiplier  Mechanization 
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or  we  could  mechanize  by  summing  this  over  k,  then  over  j 


X1  *1  *2  *2  M  yJ 


FOR  LARGE  J 
THIS  IS  QUITE 
A  MESS  TOO 


Figure  5  Conventional  Mechanization 


Neither  solution  is  satisfactory.  Each  solution  represents  an  extreme. 

If  we  explore  summing  over  part  of  j,  several  times,  then  over  k,  then  our 
remainder  of  j  (which  we  can  do  because  the  order  of  operations  does  not 
affect  the  answer),  we  have  in  Interesting  hybrid  solution. 


3.2 


The  Search  for  an  Optimum  Answer 


4 


Suppose 

1.  We  use  M  table-lookup  multipliers 

2.  Each  block  multiplier  operates  on  vectors  of  dimension  N 

Since  the  total  dimension  of  the  vectors  in  the  problem  Is  J,  then  0=MN. 
A  product  accuracy  of  K  +  1  bits  will  be  maintained,  so 


P 


J 


l 

j*l 


Vj 


M 

£ 

m=l 


P 


m 
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where  pm  is  the  output  of  the  m^-  block  multiplier 
N 

pm  ~  Xn+N(m-1)  yn+N(m-l) 


and  each  x  can  be  decomposed  into  its  constituent  bits: 


Xn+N(m-1) 


in  -k 

l  a(n,m,N,k)2 
k=Q 


The  new  system  which  we  are  examining  is  shown  below  in  Figure  6. 


v,  v, 


N  yN+1  V2N  YJ+VN  yj 


x2 

*N 


Figure  6  Candidate  Hybrid  Configuration 
for  Inner-Product  Generation 
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We  have  much  more  powerful  processing  means  If  both  x  and  y  components 
are  free  variables.  Let’s  go  back  to  the  Inner-product  statement 


P  *  s  x.y . 
j=l  J  J 


where  x  was  composed  of  K  +  1  bits: 


x.  -  z  aik^ 
J  k=0 


-k 


Now  y  can  be  similarly  composed  of  L  +  1 


y-j  =  ^  b.^2 

J  £=0  J*- 


-l 


so 


p-  5  '  \  ajk  v 


If  we  define  a  new  exponent  of  2 


1  A  k+£ 


the  inner-product  statement  becomes 


-(k+£) 


>  =  ?  ?  j  ‘jk^.i-k2' 

which  Is  one-dimensional  convolution. 
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bi  ts : 


Now  we  need  a  computational-complexity  measure. 


A  mini mum- complexity  mechanization  for  a  serial/parallel  multiplier 
requires  K  full  adders,  plus  some  overhead  logic.  We  will  use  this 
to  construct  our  computational  complexity  measure. 


J 

Conventional  mechanization  of  p  =  i  x.y.  therefore  requires 

j=l  J  J 


K  adders 
product 


x  J  products  +  [J- 1 ]  adders  to  sum  products  =  KJ+J-1  adders. 


A  block  multiplier  requires  bVoc^muTt'  ^or  shift-and-add  section 
plus  the  adder  tree. 


The  number  of  adders  required  to  combine  N  inputs  into  sets  of  S  isj 
therefore,  the  total  number  of  adders  in  a  block  multiplier  structure  is 


$)’ 


N 

E 

s =2 


Eq  ^  -  (N+l )  =  2H  -  (N+l ) . 


The  total  number  of  adders  for  the  hybrid  collection-of-block-multipl iers 
approach  is 


[K+2N-(N+1)J  muTtipITer  x  M  mu1tiP^iers  +  [M-1]  adders  to  sum 

products . 

Table  2  explores  the  boundary  conditions  of  mechanization  approach  vs 
the  dimension  of  the  vector  for  two  sample  word  lengths;  8  bits  and  16  bits 
are  explored  as  an  example. 
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Table  2  -  BOUNDARY  CONDITIONS 


COMPARISON  OF  MULTIPLIER  COMPLEXITY 
(NUMBER  OF  ADDERS  REQUIRED) 


Dimension  of 

M=1 ;  N«J 

M=J; 

N=1 

Vectors,  J 

Block  Multiplier 

Conventional 

K+2J-(  j+l ) 

J(K+1 )-l 

8  Bits 

16  Bits 

8  Bits 

16  Bits 

2 

8 

16 

15 

31 

3 

11 

19 

23 

47 

4 

18 

26 

31 

63 

5 

33 

41 

39 

79 

6 

64 

72 

47 

95 

7 

127 

135 

55 

111 

8 

254 

262 

63 

127 

9 

509 

517 

71 

143 

10 

1020 

1028 

79 

159 
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We  see  that  an  optimum  solution  lies  between  1<M<J  and  1 <N< J. 

The  number  of  adders  used  in  one  hybrid  solution  is: 

A  =  [K+2N  -  (N+l )]  M+[M-1]  =  [K+2N-N]  M-l 

Our  formal  minimization  procedure  is  the  following: 

Normalize  with  respect  to  J  =  MN 

Define  the  total  number  of  bits  as  B  =  K+l 

For  computational  convenience,  define  the  auxiliary  variable 


_  B-1+2N 

-  fj - 


and  minimize  ^  with  respect  to  N.  Now,  let  us  compare  this 
result  with  that  of  conventional  mechanization: 

Ac  =  KJ+J-1  =  (B-l)0+0-1  =  BO-1 

For  computational  simplicity  we  will  describe  another  auxiliary 
variable: 


Notice  that  we  cannot  compare  this  with  straight  block  multiplier 
mechanization  using  the  same  formulation  because  the  parameter  J 
cannot  be  removed  by  normalizing,  i.e.: 

A0  =  K+2°  -  (J+l)  =  B+2J  -  J-2 

£  _  V1  _  B+2J  -1  , 

ab  ~  — - o - 1 
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A 

A  comparison  of  A  for  various  values  of  B  and  N  Is  shown  In  Table  3.  The 

A 

N=1  column  Is  Identically  the  same  as  Ac-  For  word  lengths  of  4  bits  the 
optimum  combination  Is  2-element  vectors;  for  word  lengths  of  8,  12,  and  16 
bits,  the  optimum  combination  Is  3-element  vectors;  for  word  lengths  of  20, 
24,  28,  32,  and  36  bits  the  optimum  combination  is  4-element  vectors. 


TABLE  3.  ft  FOR  VARIOUS  B,  N  VALUES 


N.  N 

B \ 

l^c> 

2 

3 

4 

5 

4 

4 

21/2 

2  2/3 

6 

8 

8 

4  1/2 

4 

6  4/5 

12 

12 

61/2 

5  1/3 

5  3/4 

7  3/5 

16 

16 

81/2 

6-2/3 

6-3/4 

8-2/5 

20 

20 

10  V2 

8 

7-3/4 

9-1/5 

24 

24 

XL  VI 

9-1/3 

8-3/4 

10 

28 

28 

14  1/2 

10-2/3 

9-3/4 

10-4/5 

32 

32 

161/2 

12 

10-3/4 

11-3/5 

36 

36 

18  1/2 

13-1/3 

H-3/4 

12-2/5 

Shown  In  Table  4  is  a  tabulation  of  for  various  values  of  B  and  J. 
Not  suprlslngly.  If  J  is  replaced  by  N,  Table  4  contains  the  same 
numerical  entries  as  Table  3,  validating  again  our  conclusion  of  what 
constitutes  a  minimum-complexity  multiplier. 


-3 

CO 


CO 

«c 


LU 


co 

< 
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13-1/3  11-3/4  12-2/5  15-1/2  22-2/7  35-3/8  59-7/9  104-9/10 


A  Solution  Based  on  Applications 

The  following  set  of  observations  appears  to  be  valid: 

For  word  lengths  of  8,  12,  and  16  bits,  3x3  table  lookup  multi 
pliers  are  most  efficient. 

For  word  lengths  greater  than  16  bits,  4x4  table  lookup  multi¬ 
pliers  are  most  efficient.  If,  however,  we  used  the  3x3  table 
lookup  multipliers  for  the  greater  word  length  cases,  the  per¬ 
centages  of  extra  adders  (over  the  optimum  4x4)  is 


TABLE  5  ADDER  REQUIREMENTS 
FOR  3x3  ORGANIZATION 


We  may  therefore  conclude  that  the  penalty  is  small  for  the 
advantages  gained  by  using  the  standardized  3x3  table  lookup 
multiplier  structure. 

On  the  other  hand,  if  we  use  the  4x4  table-lookup  multiplier 
for  the  shorter  word  lengths,  the  penalty  paid  in  terms  of 
additional  adders  over  the  3x3  table-lookup  multiplier  is 
shown  below  in  Table  6. 


TABLE  6  ADDER  REQUIREMENTS 
FOR  4x4  ORGANIZATION 


WORD  LENGTH,  B 

%  OF  ADDITIONAL 
NORMALIZED  ADDEF'S 

4 

41 

8 

19 

12 

8 

16 

1 

We  have  to  look  now  at  applications  In  order  to  decide  if  we  should 
use  a  3x3  multiplier  as  standard  or  a  4x4  multiplier. 


The  4x4  multiplier  is  a  natural  building  block  to  perform  a  2D  transform  of 
N  N 

dimension  2x2  where  N  is  an  integer. 

The  4x4  multiplier  is  also  the  natural  building  block  to  perform  a  complex 
FFT  butterfly  computation.  The  decimate-in-frequency  butterfly  is  described  by 

x  =  u  +  v 
y  =  wz 
z  =  u  -  v 

where  the  complex  inputs  are  u  and  v,  the  complex  outputs  are  x  and  y,  the 
complex  weighting  coefficient  is  w,  and  z  is  an  auxiliary  variable.  Then 


which  can  be  performed  handily  by  our  4x4  structure.  The  decimate-in- 
time  butterfly  is  described  by 

x  =  u  +  p 
y  =  u  -  p 
p  =  vw 

where  u,  v,  w,  x,  and  y  are  as  described  above  and  p  is  an  auxiliary  variable. 
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X  = 

r 

ur 

+ 

vr 

wr 

— 

vi 

wi 

xi  = 

ui 

+ 

vr 

wi 

+ 

vi 

wr 

yr 

ur 

- 

vr 

wr 

+ 

vi 

wi 

yi  = 

ui 

- 

vr 

wi 

- 

vi 

wr 

or 


"xr" 

xi 

yr 

1 

>r 

_ i 

w„ 


1  wi 
0  -w. 


1  -Wi 


-w. 


w„ 


wi 


which  again  can  be  performed  handily  by  the  4x4  structure. 


A  set  of  4  such  vector  multipliers  can  also  be  used  directly  as  a 
general  4x4  coordinate  transformer  or  as  a  matrix  multiplier,  or 
to  mechanize  a  filter  as  described  below  in  Table  7. 


TABLE  7  FUZZ-PHRASE  FILTER  LIST 
(Pick  a  line  from  ach  category,  A  thru  G,  to  describe  function) 


A. 

E. 

1.  Fixed  Coefficient 

1.  One-Dimensional 

2.  Programmed  time- variable 

2.  Two-Dimensional 

parameter 

3.  Adaptive 

F 

B. 

I.  Low-pass 

1.  Linear 

2.  High-pass 

2.  Programmed  Nonlinear 

3.  Bandpass 

4.  Band  reject 

C. 

5.  All  pass 

1.  Recursive 

6.  Arbitrary  spectrum 

2.  Transversal 

shaping 

D. 

G. 

1 .  Fixed  Structure 

1.  Filter 

2.  Variable  Order  (up  to  16  taps) 

2.  Equalizer 
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The  same  identical  architecture  can  be  programmed  to  also  function  as  a 

1.  Modulator/Demodulator  (set  of  4) 

2.  Coordinate  Transformer  (up  to  4  rotations) 

3.  Polynomial  Function  Generator 

4.  Element  of  Pattern  Classifier 

5.  Multiplexer  (4-4:1  or  2-8:1  or  1-8:1  and  2-4:1  or  1-12:1  and 
1-4:1  or  1-16:1) 

6.  Matched  Filter 

7.  Edge-Extraction  Mask 

8.  Sobel  Operator 

9.  Cosine  Transformer  (4x4,  or  8x2,  or  16x1) 

10.  Hadamard  Transformer 

11.  Unsharp  Masking 

12.  Despike  Element 

13.  Etc. 


For  Operations  in:  Mapmatching 

Midcourse  Updating 
Doppler  Radar  and  processing 

Target  detection,  identification,  tracking,  cueing 

Aimpoint  selection 

Correlation 

Data  windowing 

Filtering  of  signals 

Sonar  spectrum  analysis 

Inertial  platform  stabilization 

Instrument  caging 

Flight  control  stability  augmentation 
Adaptive  noise  cancelling 
Speech  enhancement 

Adaptive  line  enhancement/cancellation 

Channel  equalization  for  data  modems 

Data  compaction/AJ  protection 

FLIR  Display  Systems 

Dynamic  range  compression  functions 

Autothresholding  for  video  noise  limiting 

Chrome  separation  for  digital  video 

Pattern  recognition  systems 

Automatic  fingerprint  classification 

Optical  character  reader 

Nonlinear  noise  filter 

Thin-fill  2D  data  reduction 
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3.5  A  Candidate  Element 

The  skeleton  of  a  basic  signal  processing  section  based  on  the  foregoing 
is  shown  in  Figure  7.  This  circuit  mechanizes  in  each  of  4  sections  the 
following  function 


3 

yk  =  Lj  >jk  Xjk 
1=0 

where  the  may  be  independent  (general  inner  product),  or  successive 
samples  of  an  input  (transversal  filter)  or  some  may  be  successive  samples 
of  the  output  (recursive  filter).  The  coefficients  a^  may  be  wholly 
replaced  each  sample  time,  or  incrementally  updated  (adaptive  filter). 


N 

Figure  7  Skeleton  of  Signal  Processing  Section 
(1  of  4  sections  on  a  chip) 


The  4  data  words  address  the  partial -product  memory  as  In  section  3-1.  The 
contents  of  the  partial -product  memory  were  obtained  by  combining  the 
coefficients  through  the  adder  tree. 
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Buttner  and  Schiissler  have  shown  that  since  can  be  expressed  in  terms 
of  its  N  bits,  b.^,  then  for  2 1 s- complement  format, 

Xjk  =_bjkO  +  bjkn  2  " 

and  since 

xjk  =  \  xjk'T  (_xjk^  =  ^hjkO^jkO^  +  1  ^  (bjkn"bjkn)2  ”  '2  N 
then  N-1 

yk  =  \  M(°>2(N'1)  +  n^q^bjkn})‘  (bjkO'bjkO)  ^{tW)] 


where 


q(")  =  q(<bJkn>)  *  ,Eo  ajkft>jkn-bjkn> 


q(n)  =  -q(l 5-n) 


n  *  [O,  7] 
n  ,  ^8,  15^ 


An  extremely  efficient  streamlined  mechanization  is  shown  below.  The  a's 
are  combined  to  generate  all  possible  q(n)  as  shown  in  the  adder  tree  of 
Figure  8. 


*i  tf*>  I  tf, 

I  0  « 


Figure  8.  Adder-tree  Section 
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The  a's  are  16-bit  numbers,  but  the  q(n)  are  18-bit  numbers  since  q(n) 
can  be  as  large  as  the  sum  of  4  a's.  We  want  3  sets  of  "q"  registers; 
one  as  the  momentary  working  set,  and  one  or  two  being  loaded  as  shown 
in  Figure  9. 


Figure  9.  Partial -Product  Register  Section 


The  outputs  which  feed  the  adder/subtractor  are  then  summed  (or  differenced) 
according  to  yk  =  Eq(n)2”n. 


The  selection  of  a  particular  q(n)  is  according  to  the  bits  in  Xjk, 
the  bjkn.  The  decoding  network  (partial-product  register  selector) 
shown  in  Figure  10. 


i.e. , 
is 
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Figure  10.  Partial -Product  Register  Selector 

The  circuit  of  Figure  10  is  driven  by  the  data-input-register  section  which  is 
shown  in  Figure  11.  The  4  shift  registers  may  each  be  driven  from  a  serial 
input  path,  chained  with  the  preceding  register  or  loaded  from  the  output. 

These  3  options  for  each  of  4  registers  give  12  control  states  which  are 
addressed  by  4  function-select  lines. 

The  x  outputs  from  the  shift  registers  may  also  be  used  to  provide  the  "internal" 
signals  to  the  coefficient  update  register  section  which  is  shown  in  Figure  12. 

This  is  to  facilitate  the  adaptive  filter  update  computation 

ajk  =  aj  (k-1 )  +  pXjksgn^k)-6 

The  u  is  mechanized  by  the  shift.  Since  the  update  must  be  formed  before  the 
error  signal,  ek,  is  available,  one  update-register  section  assumes  >  0,  the 
other  assumes  <  0.  The  correct  q-register  set  is  chosen  after  has  been 
completed.  This  is  the  reason  for  two  sets  of  registers  being  loaded  simultaneously. 

A  basic  signal-processing  section  based  upon  the  foregoing  is  shown  in  Figure  13. 


24 


l  T-1— f5-y  J 

INPUT  SHIFT  ACCUM  ADD/ 

SELECT  SELECT  SUBTR. 


Figure  12.  Coefficient-Update-Register  Section 


Figure  14  shows  how  the  signal-processing  sections  of  Figure  13  are 
Interconnected  to  extend  the  upper  limit  of  the  y^  sum  from  3  to  a 
number  as  great  as  15  on  a  single  chip. 
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Figure  14.  Total  Signal-Processing  Device 

The  functions  which  we  have  discussed  above  are  capable  of  mechanizing 
signal-processing  requirements  such  as: 

•  Vector-matrix  operator  with  fixed  or  variable  coefficients  4x4, 
or  2  x  8,  or  1  x  8  and  2  x  4,  or  1  x  12  and  1  x  4  or  1  x  16 
(dimensions  may  be  raised  by  chaining  with  other  devices.) 

•  Generalized  fast-general i zed- transform  operator  (decimate  In  time 
or  decimate  In  sequency). 

•  Digital  filter  (up  to  4) 

fixed  parameter,  variable  parameter  or  adaptive 
denominator  order,  D  e{0,l 5} 
numerator  order,  N  e{0,15-D} 


Set  of  4  modulators/demodulators 
Multiplexers 

4-4:1  or  2-8:1  or  1-8:1  and  2-4:1  or  1-12:1  and  1-4:1  or  1-16:1 
Image-processing  functions  such  as  sliding  windows 

3.6  Problems  and  Retrenchment 

A  preliminary  transistor  count  revealed  that  this  very  desirable  structure 
would  be  an  extremely  ambitious  circuit  with  88,000  transistors.  The  device 
design  could  not  be  completed  under  the  contract. 

A  second,  less  ambitious,  structure  which  was  a  nine-element  vector  multi¬ 
plier  was  pursued  into  the  design  and  layout  stages.  That  design,  which 
became  identified  as  the  PIPE  device,  is  described  in  the  following  chapter. 
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4.  The  Nine-Element  Vector  Multiplier 
4. 1  Overall  Description  of  PIPE  Device 

The  PIPE  provides  every  nine  10  ns-clock  periods  the  sum  of  products 

y  ■  i,  w 

where  each  a  and  x  is  an  8-bit  21 s-complement  number,  thereby  performing 
a  true  multiply-and-accumulate  function  10  times  per  second.  The  full  19- 
bit  product  is  available  as  an  output  which  permits  the  devices  to  be  com¬ 
bined  to  perform  higher  accuracy  computations.  The  coefficient  a's  are 
parallel  loaded  and  stored  on-chip  while  the  data  x's  may  be  loaded  serially 
or  in  parallel  in  a  fashion  which  makes  the  chip  directly  usable  as  a  FIR 
(finite-impulse  response)  filter  described  by  the  transfer  function 

8 

G(2 )  =  z  a  z"n 
n=0  n 


Any  number  of  such  chips  can  be  chained  together  to  form  a  longer  filter. 
Only  an  external  summing  means  is  required  to  accumulate  the  final  result. 
The  device  description  is  given  below: 

Supply  Voltage:  5  volts 
CMOS/SOS  2  urn  technology  using  static  logic 
Clock  Frequency:  100  MHz 
Operating  temperature  range:  -55°C  to  +125°C 
Packaging:  leadless  hermetic  chip  carrier  corresponding 
to  JDEC  specification 

Input  specification:  there  are  7  input  formats;  all  input 
patterns  must  occur  within  9  clock  periods. 

These  formats  are: 

1.  single  parallel  input  applied  9  (or  fewer) 
consecutive  times. 

2.  single  parallel  input  applied  once 

3.  3  parallel  inputs  applied  3  times 

4.  3  parallel  inputs  applied  once 


5.  single  serial  input  applied  once 

6.  three  serial  inputs  applied  once 

7.  nine  serial  inputs  applied  once 

The  input  data  word  is  in  2's-complement,  8-bit 
format.  The  format  is  controlled  by  a  3-bit 
format-control  line.  The  input  section  has  28 
pins  (3x8,  1  out,  3  control). 

Output  specification: 

Single  10-bit  output  tristate  bus  for  full  accuracy. 

Least  significant  10  bits  available  immediately. 

Flag  indicates  when  the  10  most  significant  bits  are  ready. 

Most  significant  bits  on  output  bus  in  response  to  external  strobe. 
The  output  section  has  14  pins  (10  out,  1  flag  for  LSBs  ready, 

1  flag  for  MSBs  ready,  1  MSB  strobe,  1  reset). 

Coefficient  specification: 

Single  8-bit  input  bus  for  one-at-a-time  parallel  loading  of  8-bit 
2's-complement  coefficient. 

Separate  4-bit  input-address  identification. 

One  load  control,  one  memory-write  control. 

The  coefficient  section  has  14  pins. 

The  pins  required  by  the  PIPE  device  are  given  below: 

24  for  3  parallel  8-bit  input  data  words 

1  for  serial  data-line  out 

3  for  input-data  format 

10  for  parallel  output  data 

2  for  output  flags 

2  for  output  control 
8  for  parallel  coefficient  word  in 

4  for  coefficient  address 
2  for  coefficient  control 
1  for  word  timing 

1  for  clock 
1  for  power 
1  for  ground 
60  pins  committed 
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Recall  that  the  PIPE  forms  the  product 
9 

y  =  i  ax 
n=l  n  n 

where  the  x„  are  input  data  words,  the  a  stored  coefficients,  and  y  is 
n  n 

the  sum  of  the  products.  Both  the  an  and  the  xn  are  8-bit,  2's-complement 

numbers;  >:  is  composed  of  the  bits  (b  }  where  m  =  0,  1 ,  2,  . . .  7.  In 
n  nm 

order  to  easily  describe  the  computational  approaches  which  were  con- 


idered,  let 

us  examine 

the  array  of 

bits 

which 

form 

the  {x 

X1  ’ 

b10 

bll 

b12 

bl  3 

b14 

bl  5 

b16 

b17 

x2: 

b20 

b21 

b22 

b23 

b24 

b25 

b26 

b27 

x3: 

b30 

b31 

b32 

b33 

b34 

b35 

b36 

b37 

X4: 

b40 

b41 

b42 

b43 

b44 

b45 

b46 

b47 

x5: 

b50 

b51 

b52 

b53 

b54 

b55 

b56 

b57 

x6 : 

b60 

b61 

b62 

b63 

b64 

b65 

b66 

b67 

x?: 

b70 

b71 

b72 

b73 

b74 

b75 

b76 

b77 

x8: 

b80 

b81 

b82 

b83 

b84 

b85 

b86 

b87 

x9: 

b90 

b91 

b92 

b93 

b94 

b95 

b96 

b97 

Figure  15.  Data-Bit  Array 
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The  combining  of  this  array  of  bits  with  the  "a"  coefficients  is  the  process 
by  which  the  desired  result,  y,  is  obtained.  There  exist,  however,  several 
diverse  means  by  which  this  combining  may  be  accomplished.  One  means  uses 
the  array  of  bits  in  a  row-by-row  fashion.  Each  individual  row  is  multiplied 
by  the  corresponding  "a"  coefficient  and  the  results  summed  in  an  accumulator 
with  the  results  of  the  previously  executed  products.  This  standard  lumped- 
arithmetic  approach  requires  9  full  multiplications  and  8  adds  into  the  accumulator. 
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One  can  show  that  4BAAT,  5BAAT,  etc.,  algorithms  reduce  to  the  3BAAT 
cases  for  binary  multiplication.  Some  improvement,  but  not  enough. 

In  our  quest  for  greater  computational  efficiency,  we  shall  approach 
a  second  means  which  uses  the  array  of  bits  of  Figure  15,  not  in  a 
row-by-row  fashion,  but  in  a  column-by-column  fashion.  Interestingly, 
the  set  of  bits  in  a  column  is  used  as  a  memory  address.  This  is  an 
adaption  of  the  candidate  element  architecture  which  was  discussed 
earlier. 


The  contents  at  that  memory  address  is  summed  in  an  accumulator  with 
one-half  the  previous  results.  This  procedure  requires  8  table-lookup 
operations  and  8  adds  into  the  accumulator,  not  to  form  each  yn,  but 
to  form  the  total  result,  y.  The  advantages  and  efficiency  of  this 
latter  distributed-arithmetic  method  are  obviously  great.  The  follow¬ 
ing  paragraphs  describe  the  means  of  computation  in  detail. 


4.3  The  Mechanics  of  the  Algorithm 


Here's  how  it  works.  Each  x  is  composed  of  the  8  bits,  b„ 

n  r  nm 

which  combine  as  follows  to  establish  the  value  of  xn: 


x  ■  -b  +  I  b  2 
n  no  m=1  nm 


-m 


(2) 


The  sign  bit,  bnQ,  is  unity  if  xn  is  a  negative  number,  and  is  zero  other¬ 


wise.  Now,  since  ® 


(3) 


we  may  then  express  (2)  as: 


7  m  7 

x  -  l/2[-(b _ -b„J  +  z  (b._  -  b„J  2”m  -2-/] 

n  1  no  no  ,  nm  nm  J 

m=  l 


(4) 
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<~s-m  :?r** 


i 


Substituting  (4)  into  (1)  yields  the  following  expression: 


m=1  n=l 


y  =  1/2  \-2~'  |  I  aj  +  Z  I  an(bnm  -bnm)  |  2'"'  -  Z  an(bBn-bnn) 


Initial 

Condition 


Partial 

Product 


n  no  no 


Sign 

Correction 
(m=0  term) 


The  possible  value  of  each  bnm  is  either  0  or  1 ,  hence  the  possible  value 
of  each  term  (b„m  -b  m)  is  +  1.  The  bracketed  term  within  the  "partial 
product"  braces  of  (5)  can  take  on  a  total  of  29  possible  values,  but  all 
entries  appear  twice;  hence,  there  are  only  h  2  9  =  256  distinct  values. 

If  the  "a"  coefficients  are  each  8  bits  in  length,  then  each  of  the  256 
values  will  be  stored  with  an  accuracy  of  8+[log,,9]RU  =  12  bits.  The ’feign 
correction"  values  as  well  as  the  bracketed  part  of  the  "initial  condition" 
value  happen  to  be  among  the  256  distinct  values.  Certainly,  one  valid 
approach  to  computing  (5)  is  to  use  a  table-lookup  operation  in  which  during  the 
first  clock  period  we  form  the  first  partial  result.' 

rl  *  t-f,  V*[  \(b„7  -ln7»  (6) 

Both  bracketed  quantities  were  obtained  from  the  coefficient  memory. 

During  the  2nd  through  7th  periods  we  form  the  2nd  through  7th  partial  results 

9 

r  =  1/2  r  ,  +  [  z  a  (b  D  -  b  o  )]  (7) 

P  P_1  n=i  n  n»8"P  n,8~p'J 

where  p=2  through  7.  During  the  8th  clock  period  we  form  the  final  result 


9 

2y  =  rfi  =  1/2  r7  -  z  a  (b 
B  '  n=l  n 


(b  -b  ) 
n  no  no 


During  the  9th  clock  period  the  result  is  transferred  to  the  output 
register,  the  circuits  are  reinitialized,  and  we  are  ready  to  begin 
another  cycle  of  computing  y.  A  block  diagram  of  this  structure  is 
shown  in  Figure  16. 


ADDER 

DECODER  TREE 


Figure  16.  Basic  PIPE  Architecture 


4-4.  Modifying  the  Basic  Algorithm 

A  decision  was  made  to  make  the  chip  so  that  the  coefficients 
could  be  changed.  In  order  to  be  able  to  change  coefficients  during 
computation,  two  memory  sets  would  be  required;  one  being  the  present 
working  memory  containing  functions  of  the  "old"  coefficients,  the 
other  being  the  memory  into  which  we  load  functions  of  the  new  co¬ 
efficients.  Unfortunately,  the  resulting  2  x  256  word  x  12  blt/word 
6144  bit  high-speed  memory  was  not  practical  to  implement. 
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Two  simplifying  steps  were  taken.  First,  a  restriction  was  established 
such  that  the  coefficients  could  not  be  changed  during  computation. 

This  halved  the  memory  requirement.  Secondly,  we  partitioned  y  =  y^  +  y^ 
so  that 


y,  =  E  ax 
1  n=l  n  n 


consequently. 


2y,  =  -2- 


y,  =  E  ax 
t  n—  Cn  d  n 
n-b 


7  4 


E  a  +  E  [  E  a  (b  -b  )]2  -  E  a  (b  -b  ) 

n=l  n  m=l  n-l  "  nm  nm  n*l  "  n0  n0 


7  9  7  9  9 

2y,  =  -Z~  E  a  +  E  [  E  an(b  -bnJ] 2'm  -  E  an(b  -bnJ 

c  „  c  n  _ i  «  c  n  nm  nm  J  c  n  no  no 

n=b  n-l  n=b  n=b 


Now  for  y.|  we  need  only  a  1/2-24  =  8-word  partial -product  memory,  and  for 

5 

y2  we  need  only  1/2-2  =  16-word  partial -product  memory,  a  dramatic 
reduction  in  the  number  of  stored  words. 


The  1 


6  possible  values  of  E  a  (b  -b  )  are  +An  through  +A7  as 

1  n  '  nm  nm'  -  0  v  -  1 


shown  in  Table  8;  the  32  possible  values  of  E  an  (b  -b  )  are  h_Bq 

n=5 


through  +B^5  as  shown  in  Table  9. 
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TABLE  8.  "A"  PARTIAL  PRODUCT  MEMORY  CONTENTS  AND  ADDRESSING 


b 


I 

I 


I 


I 

I 


ADDRESS 

"A"  PARTIAL 

PRODUCT  MEMORY 

3m 

b2ra 

blm 

0 

0 

0 

- 

(a4 

+  a3) 

- 

(a2 

+  aL) 

=  "Ao 

0 

0 

1 

- 

(a4 

+  a3) 

- 

(a2 

■  V 

=  "A1 

0 

1 

0 

- 

(a4 

+  a3> 

+ 

(a2 

"  al) 

“  -A2 

0 

1 

1 

- 

(a4 

+  a3} 

+ 

(a2 

+  a3> 

-  -A3 

1 

0 

0 

- 

(a4 

"  a3) 

- 

(a2 

+  al) 

=  -A4 

1 

0 

1 

- 

(a4 

*  a3> 

- 

(a2 

"  al) 

*  "A5 

1 

1 

0 

(a4 

-  a3) 

+ 

(a-7 

‘  al 

1 

1 

1 

(a4 

“  a3) 

+ 

(a  2 

+  a^) 

=  “A7 

0 

0 

0 

+ 

(a4 

-  a5) 

- 

(a2 

+  a3) 

=  +A? 

0 

0 

1 

+ 

1 

(a4 

'  V 

- 

(a2 

"  ai} 

=  +A6 

0 

1 

0 

+ 

(a4 

-  a3) 

+ 

(a2 

-  a3) 

=  +a5 

0 

1 

1 

+ 

(a4 

“  a3> 

+ 

(a2 

+  a3) 

=  +a4 

1 

0 

0 

+ 
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0 

1 
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- 
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1 

0 
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-  a3) 

=  +AX 

1 

1 

1 

+ 

+  a  ) 

+ 

(a. 

+  a  ) 

"  +Ao 

(fi 

V 


CO  d) 
>  U 

00  O 


cr 


0) 

60 

C 

<0 

x 

V 

c 

60 

•H 

CO 


I 

\ 


i 


Only 


TABLE  9. 


'B"  PARTIAL  PRODUCT  MEMORY  CONTEXTS  AM)  ADDRESS  IV 
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4-5  Mechanization  of  the  Modified  Algorithm 

The  PIPE  consists  of  six  basic  sections:  the  coefficient  input  section, 
the  partial -product  memory  section,  the  data  input  section,  the  arithmetic 
section,  the  output  section,  and  the  timing  and  control  section. 

4.5.1  Coefficient  Input  Section 

The  coefficients  (the  a's)  of  the  defining  equations  of  the  PIPE  are 
each  8-bit  signed  fractional  numbers  such  that  -l<a<l,  the  maximum  value 
of  which  is  l-2‘7.  The  partial-products  (partial-product-memory  content) 
for  y-|  is  from  the  set  of  numbers  Aq  through  Ay,  the  largest  of  whose 
values  can  be  as  great  as  4(l-2~7).  Similarly,  the  partial -products  for  y^, 
from  the  set  of  numbers  BQ  through  B^,  can  have  values  as  great  as 

5(1-2"7)  =  23p-2_1  -2“2  -2-7  +2"8  +2-10] 

an  11 -bit  number.  We  shall  now  examine  the  procedures  by  which  these 
numbers  are  generated. 

The  coefficient-input  section  is  functionally  diagrammed  in  Figure  17. 

There  are  9  8-bit  wide,  parallel  registers,  designated  a-j  through  ag  into 
which  the  coefficients  are  loaded.  Their  inputs  are  wired  in  parallel  to 
a  common  8-bit  coefficient-input  bus.  When  the  "read  input"  line  is  true, 
the  number  which  is  present  on  the  8-bit  coefficient  bus  is  loaded  into  the 
input  register  which  was  identifit  j  the  pattern  on  the  "input-address" 
lines.  The  loading  of  these  coefficients  is  completely  independent  of  the 
functioning  of  the  arithmetic  section  of  the  PIPE.  However,  after  all  the 
coefficients  which  are  to  be  loaded  into  the  PIPE  have  been  loaded  (by  the 
procedure  which  was  described  above)  then  the  PIPE  is  asked  to  "digest" 
these  values.  When  the  "load  memory"  Input  is  true,  a  sequence  of  events 
is  initiated  which  Is  indicated  in  the  timing  diagram  of  Figure  18.  First, 
coefficient  registers  a-j  through  parallel-load  the  shift  registers  1 
through  4.  The  outputs  of  the  shift  registers  are  clocked  through  comple¬ 
menters  to  generate  the  negative  of  the  numbers.  Each  complementer 
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Coefficient  Input  Timing  Diagram 


consists  of  an  inverter  followed  by  a  single-input  serial  adder  with  a 
"1"  preset  in  its  carry  flip  flop.  The  4  streams  of  serial  data  from 
the  4  complementers  pass  through  the  adder  tree  to  generate  -Aq  through 
-A -j.  The  last  bit  of  the  serial  data  is  held  at  the  output  of  each  shift 
register  for  3  additional  clock  periods  in  order  to  spread  the  sign  and  to 
drive  the  adder  tree  by  11-bit  long  data  streams.  This  is  a  necessary 
step  to  ensure  sign  correction.  By  forming  negative  values  rather  than 
positive  values,  we  can  avoid  some  carry  propagation  problems  later. 
Immediately  after  the  formation  of  -Aq  through  -A y,  registers  ag  through 
ag  then  parallel  load  shift  registers  1  through  5  and  the  process  which 
was  described  to  generate  -Aq  through  -A^,  is  repeated  in  order  to 
generate  -Bq  through  -B^,  but  using  5  data  streams  rather  than  4. 

While  the  PIPE  is  loading  its  coefficients  as  described  above,  the  out¬ 
put  from  the  arithmetic  section  is  not  valid,  therefore,  the  output 
register  (which  is  discussed  in  section  4.5.5)  is  forced  to  a  RESET  state. 
After  the  coefficient  digestion  has  occurred,  the  coefficient  load-control 
logic  pauses  until  the  output  from  the  arithmetic  section  is  again  valid. 

At  that  time  the  output  register  is  restored  to  normal  operation. 

4.5.2  Partial -Product-Memory  Section 

The  organization  of  the  memory  section  is  shown  in  Figure  19. 

Eight  serial  data  streams,  -Aq  through  -A^,  are  serially  clocked  from 
the  adder  tree  into  partial-product  memory  A,  thereby  loading  the  upper 
half  of  Table  8  into  the  memory.  Each  of  these  values,  -Aq  through  -Ay, 
although  loaded  serially,  will  be  read  out  in  parallel  onto  a  tristate  bus 
which  feeds  the  input  register  of  arithmetic-section  A  (which  is  discussed 
below).  The  addressing  means  for  reading  out  these  coefficients  is  also 
discussed  below. 

Similarly,  sixteen  data  streams,  -Bq  through  -Big,  are  serially  clocked 
from  the  adder  tree  into  partial  product  memory  B,  thereby  loading  the 
upper  half  of  Table  9  into  the  memory.  Again,  each  of  these  values  will 
be  read  out  in  parallel  onto  a  tristate  bus  which  feeds  the  input  register 
of  arithmetic  section  B. 
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COEFFICIENT  MEMORY  A  COEFFICIENT  MEMORY  B 

8  WORDS  AT  11  BITS  16  WORDS  AT  11  BITS 


UJ 

_  -i  _ 

to  O  J  Q 


The  lower-half  of  Table  8  or  9  is  effectively  read  into  the  arithmetic 
section  by  complementing  its  opposite-signed  counterpart,  e.g.,  +Aq  is 
entered  by  reading  -Aq  from  the  memory,  inverting  each  bit,  and  adding 
"1"  through  the  carry  input  of  the  arithmetic  section.  This  will  be 
discussed  later  in  greater  detail. 

4.5.3  Data  Input  Section 

The  data  input  section  of  Figure  20  performs  three  functions. 

The  first  function  is  the  loading  of  the  input  data  words  into  the 
appropriate  shift  registers,  sometimes  directly,  and  sometimes  via  the 
companion  parallel  register.  In  control  state  0,  no  inputs  are  accepted. 
In  control  state  1,  9  parallel  input  words  are  accepted  in  succession. 
Switches  SW1  and  SW2  are  closed  and  the  load-mode-select  logic  provides 
a  "read"  signal  to  the  parallel  registers,  1  through  9,  in  succession. 

In  control  state  2  the  switches  SW1  and  SW2  are  open.  The  load-mode- 
select  logic  provides  a  "read"  signal  to  parallel  registers  1,  4  and  7 
simultaneously;  then  to  parallel  registers  2,  5  and  8  simultaneously; 
then  to  parallel  registers  3,  6,  and  9  simultaneously.  In  control  state  3 
switches  1,  2,  5  and  8  are  open,  switches  3,  4,  6,  7,  9  and  10  are  closed. 
After  parallel  loading  registers  1,  4  and  7,  the  data  are  shifted.  The 
data  which  are  loaded  into  register  1  flows  sequentially  through  shift 
registers  1,  2  and  3;  the  data  which  are  loaded  into  register  4  flows 
sequentially  through  shift  registers  4,  5  and  6;  the  data  which  are  loaded 
into  register  7  flows  sequentially  through  shift  registers  7,  8  and  9. 

In  control  state  4,  switches  1  and  2  are  open  while  all  other  switches 
are  closed.  After  the  input  is  parallel  loaded  into  register  1,  it  is 
permitted  to  flow  sequentially  through  all  9  shift  registers.  In  control 
state  5,  serial  input  data  flow  in  through  the  SI  port.  Switches  1  and  2 
remain  open,  switches  3  through  10  remain  closed,  and  the  data  freely 
flow  through  shift  registers  1  through  9  and  out  through  the  SO  port. 


45 


In  control  state  6,  the  data  are  serially  loaded  into  all  9  shift  registers 
simultaneously.  The  input  to  the  first  shift  register  is  through  port  SI. 

The  second  through  ninth  shift  registers  are  loaded  via  the  8-bit  bus  and 
switches  3  through  10  as  shown  in  Figure  20.  In  control  state  7,  three 
serial  data  words  are  loaded.  The  first  data  word  is  passed  sequentially 
through  the  first  three  shift  registers  via  input  port  SI  and  switches  3 
and  4.  The  second  word  is  passed  sequentially  through  the  second  three 
shift  register  via  switches  5,  6  and  7,  and  the  third  data  word  is  similarly 
passed  through  the  last  three  registers  via  switches  8,  9  and  10. 

The  purpose  of  the  outputs  of  the  registers  is  described  in  Tables  8  and  9. 

The  outputs  of  the  first  4  shift  registers  drive  a  1-of-out-of  8  decoder. 

Forming  the  EXOR  of  b4m  with  each  of  b^m,  b^  and  b3m  effectively  folds  the 
lower  half  of  Table  8  onto  the  upper  half;  similarly  the  outputs  of  the  last 
5  shift  registers  drive  a  l-out-of-16  decoder.  Forming  the  EXOR  of  b^  with 
each  of  b5m  bgm,  b^m,  and  bg^  effectively  folds  the  lower  half  of  Table  9 
onto  the  upper  half.  The  outputs  of  the  4th  and  9th  shift  registers  also  control 
the  add/subtract  functions  of  parallel  adders,  A  and  B  respectively.  By  this 
means  the  adders  select  the  appropriately  signed  inputs. 

4.5.4  Arithmetic  Section 

Now  we'll  examine  the  arithmetic  section  which  is  composed  of  two  adder/ 
subtracter  sections. 

Refer  to  Figure  21.  MUX!  of  the  "A"  adder/subtracter  accepts  -Aq  through  -A^ 
which  are  the  partial-product  inputs  from  partial-product  memory  A,  and  selects 
the  appropriately  signed  inputs.  The  11 -bit  input  is  converted  to  a  12-bit 
input  by  sign  spreading;  i.e.,  the  11th  and  12th  bits  are  strapped  together. 

MUX  2  selects  either  the  initial  condition  from  the  initial-condition  register 
or  the  previous  result  divided  by  2.  The  LSB  of  the  previous  sum  is  gated  into 
a  carry  generator,  so  no  accuracy  is  lost.  Similarly,  MUX  1  of  the  "B"  adder/ 
subtracter  accepts  inputs  -Bg  through  -B^,  which  are  the  inputs  from  partial- 
product  memory  B,  and  selects  the  appropriately  signed  inputs;  MUX  2  of  the  "B" 
adder/subtracter  selects  the  initial  condition  of  the  previous  result  divided  by  2. 


Arithmetic  and  Output  Sect 


Since  the  values  which  are  stored  in  the  partial-product  memories  are  -Aq 
through  -A7  and  -Bq  through  -B^,  when  the  negative  value  is  required,  the 
memory  content  is  simply  gated  directly  through  MUX  1.  When  the  positive 
value  is  required,  the  memory  content  is  inverted  by  complementing  all  bits, 
and  a  "1"  is  added  through  the  carry  input  to  the  parallel  adder. 

After  2y^  and  2y2  are  formed,  the  final  result  is  obtained  by  gating  l/2(2y^) 
through  MUX  1  and  1/2 (2y2)  through  MUX  2  of  the  "B"  adder/subtracter.  The 
final  carry  is  provided  by  the  carry  generator  which  sums  the  LSB's  from 
both  2y-|  and  2y2.  The  final  most-significant  10-bit  part  of  y  is  then  gated 
into  a  holding  register.  The  remaining  10  least-significant  bits  of  the  full- 
precision  answer  arealso  available.  Lets  see  why  we're  interested  in  so  many 
bits  since  we  are  permitting  the  word  length  of  the  input  data  word  and  of  the 
coefficient  to  be  each  only  8  bits,  including  sign. 


The  product  of  a  pair  of  2's  complement  8-bit  numbers  is  15  bits  in  length.  The 
sum  of  5  15-bit  numbers  is  14  +  log25  =  18  bits  in  length;  the  sum  of  9  15-bit 
numbers  is  15  +  log29  =  19  bits  in  length. 


If  we  consider  the  input  data  words  (the  xn)  as  fractions,  their  format  will  be 
the  same  as  the  format  of  the  input  coefficients.  The  value  of  y-j  can  be  as  great 
as  22 - (1-2”^)^;  the  value  of  y2  can  be  as  great  as  (1+22) • (1-2^)^.  Since  the 
values  which  are  actually  formed  by  the  adders  are  2y^  and  2y2,  their  maximum 
values  in  binary  format  are 


2y 


,  =0  0111.111000 
1  max  x 


0  0  0  0  1  0  0 


2y2  max  =  0x 

4 

Sign - 


1001.11 

4-bit 

integer 

_  value 

lower 


0  110 


0  0  0  0  1  0  1 


formed  in 
delay  register 
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During  the  formation  of  y  =  1  / 2 ( 2y -j  +  -  (1-2“^)^  we  have  the  most 

significant  11  bits  of  y  in  the  MSB  register  with  a  redundant  sign  bit 
for  a  total  of  12  bits.  The  2  LSB's  are  combined  with  the  8-bit  carry- 
generator  result.  The  answer  is  provided  as  a  pair  of  10-bit  words,  the 
10  MSB’s  and  the  10  LSB's. 

The  carry  generator  is  a  simple  3-input  serial  adder  with  the  usual  sum 
and  carry  outputs.  During  each  of  the  8  computational  cycles  each 
parallel  adder  sends  the  least-significant  bit  of  its  output  to  the  carry 
generator.  The  carry  generator  in  turn  sends  its  "sum  out"  to  an  8-bit 
shift  register.  The  contents  of  the  shift  register  are  placed  in  the  8 
LSB  locations  of  the  10-bit  output  register.  The  other  2  bits  which  go 
into  this  register  are  the  2  LSB's  of  the  MSB  half  of  y.  While  the  carry 
generator  is  providing  its  8th  "sum  out"  value,  it  is  also  providing  a 
carry  input  to  parallel  adder  B  which  is  forming  its  final  answer  y  =  1/2 
(2y^  +  2 y^).  A  computing  sequence  is  diagrammed  in  Figure  22. 

4.5.5  Output  Section 

The  output  section  which  is  also  shown  in  Figure  21  consists  of  a  10-bit 
output  register  which  drives  a  tristate  bus,  a  10-bit  MSB  register,  a  10-bit 
LSB  register,  and  the  associated  control  and  switching  functions.  The  10 
LSB’s  are  derived  from  two  sources:  the  8  LSB's  are  clocked  from  the  carry 
generator  into  the  8-bit  LSB  serial /parallel  register  and  held;  this  is  augmented 
by  another  pair  of  bits  which  are  the  2  LSB's  from  the  final  output  of  the  adder 
B.  This  10-bit  result  is  then  placed  automatically  into  the  output  register. 

The  10  MSB's  may  be  read  into  the  output  register  by  making  the  READ  MSB  line 
true.  At  any  time,  the  output  register  may  be  cleared  and  the  tri state  bus 
allowed  to  float  by  making  the  RESET  line  true.  The  MSB's  will  not  be  read 
except  upon  command,  but  once  every  computational  cycle  the  LSB's  are  read. 
Reading  of  the  LSB's  can  be  inhibited  by  holding  the  RESET  line  true. 


xxxxxxxxxxx 


INITIAL  CONDITION 
FIRST  PARTIAL  PRODUCT 
IF  PARTIAL  PRODUCT  +  (CARRY  IN) 
FIRST  PARTIAL  SUM 
SFCOND  PARTIAL  PRODUCT 
IF  PARTIAL  PRODUCT  + 

SKCOI1D  PARTIAL  SUM 
THIRD  PARTIAL  PRODUCT 
IF  PARTIAL  PRODUCT  + 

THIRD  PARTIAL  SUM 
FOURTH  PARTIAL  PRODUCT 
IF  PARTIAL  PRODUCT  + 

FOURTH  PARTIAL  SUM 
FIFTH  PARTIAL  PRODUCT 
IP  PARTIAL  PRODUCT  + 

FIFTH  PARTIAL  SUM 
SIXTH  PARTIAL  PRODUCT 
IF  PARTIAL  PRODUCT  + 

SIXTH  PARTIAL  SUM 
SEVENTH  PARTIAL  PRODUCT 
IF  PARTIAL  PRODUCT  + 

SEVENTH  PARTIAL  SUM 
EIGHTH  PARTIAL  PRODUCT 
IF  PARTIAL  PRODUCT  + 

SUM  Yj  MSB’s 

LSB's  TO  CARRY  GEN. 
sum  y?  LSB's  TO  CARRY  GEN. 

MSB's 

CARRY  FROM  CARRY  GEN. 

SUM  Y  MSB's 

LSB's  FROM  SHIFT  REG. 


xxxxxxxxxxx 

(x) 


xxxxxxxxxxxx 


xxxxxxxx 

xxxxxxxxxxx 

_  X 

xxxxxxxxxxxx 

xxxxxxxx 


MSB's  _ _  I  LSB's  _ 

OUTPUT  I  *  OUTPUT1 


Figure  22.  Computing  Sequence 
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A  forced  RESET  occurs  when  a  new  set  of  coefficients  is  being  digested.  This 
RESET  will  be  held  until  the  output  of  the  arithmetic  section  is  again  valid. 

4.5.6  Timing  and  Control  Section 

Figure  23  shows  the  timing  of  all  the  arithmetic  operations.  The 
timing  requirements  are  extremely  simple.  Simple  delay,  invert,  and  NAND 
operations  on  the  input-word-timing  pulse  can  provide  all  necessary  internal 
timing  signals  for  the  arithmetic  section  as  shown  in  Figure  24. 

4.5.7  Overall  Chip  Structure 

Figure  25  illustrates  the  entire  chip  and  the  relationship  between  the 
various  parts  which  were  described  above.  The  complete  set  of  logic  diagrams 
which  were  developed  is  in  the  Appendix. 


4.5.P  Using  Multiple  Devices  for  High  Accuracy 

The  high-accuracy  computation  mode  can  be  understood  by  considering 
a  set  of  15-bit  inputs  which  can  be  expressed  as  the  following 


14 

x  =  -b  +  z  b  2 
n  no  nm 

m=  l 


-m 


F-b 


no 


7 

+  Z 


„  ,  bnm  *  t°  +  * 

m= I  m= l 


2"m]  2  7 


=  xMn 


+  x 


Ln 


where  xMn  are  the  most  significant  bits  of  xn  and  xLp  are  the  least  significant 
bits.  We  can  similarly  express  the  coefficients  an  =  aMn  +  aLn2"7  s0  the  sum 
of  the  products  can  be  written  as 


9  9  7  7 

y  =  z  a  x„  =  t.  (a*.  +  a.  2  )(xM„  +  x,  2  ) 

J  ,  n  n  ,  Mn  Ln  '  Mn  Ln 
n=  I  n= I 


9  9  9  -7  9  -14 

=  aMn  xMn  +  [  aLn  xMn  +  \  aMn  xLn]  2  +  [  K  aLn  xLn]  2 

n=l  n=l  n=l  n-l 


In  order  to  perform  multiplication  with  an  input  accuracy  of  15  bits  and  an  out¬ 
put  accuracy  of  33  bits  requires  a  set  of  four  PIPE  devices  plus  three  external 
adders . 
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Figure  23.  Timing  Diagram  for  Arithmetic  Section 
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Figure  27.  PIPE  Integrated  Circuit  Block  Diagram 
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